Code
```{r}
library(tidyverse)
library(haven)
asthmads <- read_sav("asthmads_clean.sav") %>%
as_factor()
asthmads
```In this session, several tests will be covered
There will also two part, on how to conduct the tests
Objective: To test whether employment status and gender had significant association
asthmads_clean.sav```{r}
library(tidyverse)
library(haven)
asthmads <- read_sav("asthmads_clean.sav") %>%
as_factor()
asthmads
``````{r}
with(asthmads, table(Gender, WorkStatus))
``` WorkStatus
Gender Unemployed Employed
Female 47 17
Male 33 53
```{r}
xtabs(~ Gender + WorkStatus, asthmads)
``` WorkStatus
Gender Unemployed Employed
Female 47 17
Male 33 53
```{r}
xtabs(~ Gender + WorkStatus, asthmads) %>%
prop.table(., margin = 2)
``` WorkStatus
Gender Unemployed Employed
Female 0.5875000 0.2428571
Male 0.4125000 0.7571429
```{r}
library(tidyverse)
library(janitor)
asthmads %>%
count(Gender, WorkStatus)
``````{r}
asthmads %>%
count(Gender, WorkStatus) %>%
pivot_wider(names_from = WorkStatus, values_from = n,
values_fill = list(n = 0))
``````{r}
asthmads %>%
count(Gender, WorkStatus) %>%
pivot_wider(names_from = WorkStatus, values_from = n,
values_fill = list(n = 0)) %>%
janitor::adorn_totals(c("row", "col"))
``````{r}
asthmads %>%
tabyl(Gender, WorkStatus)
``````{r}
asthmads %>%
tabyl(Gender, WorkStatus) %>%
adorn_percentages("col") %>%
adorn_pct_formatting()
```chisq.test(_) function```{r}
with(asthmads, table(Gender, WorkStatus)) %>%
chisq.test(., correc = F)
```
Pearson's Chi-squared test
data: .
X-squared = 18.128, df = 1, p-value = 2.066e-05
gtsummary::gtsummary:: package is my favourite package, providing an elegant and flexible way to create publication-ready analytical and summary tables.
```{r}
library(gtsummary)
asthmads %>%
select(Gender, WorkStatus) %>%
tbl_summary(by = WorkStatus)
```| Characteristic | Unemployed, N = 801 | Employed, N = 701 |
|---|---|---|
| Gender | ||
| Female | 47 (59%) | 17 (24%) |
| Male | 33 (41%) | 53 (76%) |
| 1 n (%) | ||
```{r}
asthmads %>%
select(Gender, WorkStatus) %>%
tbl_summary(by = WorkStatus) %>%
add_p()
```| Characteristic | Unemployed, N = 801 | Employed, N = 701 | p-value2 |
|---|---|---|---|
| Gender | <0.001 | ||
| Female | 47 (59%) | 17 (24%) | |
| Male | 33 (41%) | 53 (76%) | |
| 1 n (%) | |||
| 2 Pearson’s Chi-squared test | |||
Objective: to compare body height between gender
```{r}
asthmads %>%
ggplot(aes(x = Ht_m, fill = Gender)) +
geom_density(alpha = .5) +
theme_bw()
``````{r}
asthmads %>%
group_by(Gender) %>%
summarise(mean = mean(Ht_m, na.rm = T),
sd = sd(Ht_m, na.rm = T))
``````{r}
asthmads %>%
group_by(Gender) %>%
descr(Ht_m)
```Descriptive Statistics
Ht_m by Gender
Data Frame: asthmads
N: 64
Female Male
----------------- -------- --------
Mean 1.50 1.74
Std.Dev 0.11 0.09
Min 1.29 1.51
Q1 1.42 1.69
Median 1.48 1.74
Q3 1.56 1.79
Max 1.79 1.95
MAD 0.10 0.07
IQR 0.13 0.10
CV 0.07 0.05
Skewness 0.51 -0.05
SE.Skewness 0.30 0.26
Kurtosis -0.28 0.03
N.Valid 64.00 86.00
Pct.Valid 100.00 100.00
```{r}
library(rstatix)
asthmads %>%
group_by(Gender) %>%
get_summary_stats(Ht_m)
```t.test(_) function```{r}
t.test(Ht_m ~ Gender, asthmads)
```
Welch Two Sample t-test
data: Ht_m by Gender
t = -14.12, df = 119.43, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
-0.2703944 -0.2038862
sample estimates:
mean in group Female mean in group Male
1.503906 1.741047
In R, many functions require a formula parameter, especially in statistical modelling.
outcome ~ predictors, data,
outcome ~ group, datagroup = categorical variable = groupsgtsummary::```{r}
library(gtsummary)
asthmads %>%
select(Gender, Ht_m) %>%
tbl_summary(by = Gender,
statistic = list(all_continuous() ~ "{mean} ({sd})"))
asthmads %>%
select(Gender, Ht_m) %>%
tbl_summary(by = Gender,
statistic = list(all_continuous() ~ "{mean} ({sd})")) %>%
add_p(test = list(all_continuous() ~ "t.test"))
```| Characteristic | Female, N = 641 | Male, N = 861 |
|---|---|---|
| Ht_m | 1.50 (0.11) | 1.74 (0.09) |
| 1 Mean (SD) | ||
| Characteristic | Female, N = 641 | Male, N = 861 | p-value2 |
|---|---|---|---|
| Ht_m | 1.50 (0.11) | 1.74 (0.09) | <0.001 |
| 1 Mean (SD) | |||
| 2 Welch Two Sample t-test | |||
```{r}
library(broom)
t.test(Ht_m ~ Gender, asthmads) %>%
tidy()
``````{r}
library(rstatix)
asthmads %>%
t_test(Ht_m ~ Gender, detailed = T)
```Objective: Find significant predictors of BMI_Changes
BMI_Changes was not available, and need to calculate
asthmads_clean.sav```{r}
asthmads
``````{r}
asthmads <- asthmads %>%
mutate(BMI_Changes = BMI_Post - BMI_Pre)
asthmads
``````{r}
slm_PA_HW <- lm(BMI_Changes ~ PA_HW, asthmads)
summary(slm_PA_HW)
```
Call:
lm(formula = BMI_Changes ~ PA_HW, data = asthmads)
Residuals:
Min 1Q Median 3Q Max
-1.22966 -0.35884 0.02248 0.35711 1.39559
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.84559 0.06743 -27.37 <2e-16 ***
PA_HW -0.31476 0.01871 -16.82 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5182 on 148 degrees of freedom
Multiple R-squared: 0.6567, Adjusted R-squared: 0.6544
F-statistic: 283.1 on 1 and 148 DF, p-value: < 2.2e-16
In R, many functions require a formula parameter, especially in statistical modelling.
outcome ~ predictors, data,
+ operator
weight ~ height + age, my_data* operator for interaction
weight ~ height + age + height*age, my_data```{r}
plot(slm_PA_HW)
```gtsummary::```{r}
asthmads %>%
tbl_uvregression(method = lm,
y = BMI_Changes,
include = PA_HW)
```| Characteristic | N | Beta | 95% CI1 | p-value |
|---|---|---|---|---|
| Physical Activity (total hour per week) | 150 | -0.31 | -0.35, -0.28 | <0.001 |
| 1 CI = Confidence Interval | ||||
```{r}
asthmads %>%
tbl_uvregression(method = lm,
y = BMI_Changes,
include = c(Gender, Age, WorkStatus, BMI_Pre, PA_HW),
pvalue_fun = partial(style_pvalue, digits = 3)) %>%
bold_p()
```| Characteristic | N | Beta | 95% CI1 | p-value |
|---|---|---|---|---|
| Gender | 150 | |||
| Female | — | — | ||
| Male | 0.23 | -0.05, 0.52 | 0.108 | |
| Age (year) | 150 | 0.01 | -0.04, 0.06 | 0.609 |
| Employment | 150 | |||
| Unemployed | — | — | ||
| Employed | 0.02 | -0.26, 0.31 | 0.884 | |
| BMI_Pre | 150 | -0.02 | -0.04, 0.01 | 0.240 |
| Physical Activity (total hour per week) | 150 | -0.31 | -0.35, -0.28 | <0.001 |
| 1 CI = Confidence Interval | ||||
```{r}
library(broom)
slm_PA_HW %>%
tidy()
``````{r}
mlm_bmichanges <- lm(BMI_Changes ~ Gender + BMI_Pre + PA_HW, asthmads)
summary(mlm_bmichanges)
plot(mlm_bmichanges)
```
Call:
lm(formula = BMI_Changes ~ Gender + BMI_Pre + PA_HW, data = asthmads)
Residuals:
Min 1Q Median 3Q Max
-1.24462 -0.33420 0.01937 0.32699 1.30192
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.631072 0.232216 -7.024 7.59e-11 ***
GenderMale 0.109515 0.086543 1.265 0.208
BMI_Pre -0.010316 0.008107 -1.273 0.205
PA_HW -0.311135 0.018787 -16.561 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5168 on 146 degrees of freedom
Multiple R-squared: 0.6632, Adjusted R-squared: 0.6563
F-statistic: 95.83 on 3 and 146 DF, p-value: < 2.2e-16
gtsummary::```{r}
mlm_bmichanges %>%
tbl_regression(pvalue_fun = partial(style_pvalue, digits = 3)) %>%
bold_p()
```| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| Gender | |||
| Female | — | — | |
| Male | 0.11 | -0.06, 0.28 | 0.208 |
| BMI_Pre | -0.01 | -0.03, 0.01 | 0.205 |
| Physical Activity (total hour per week) | -0.31 | -0.35, -0.27 | <0.001 |
| 1 CI = Confidence Interval | |||
Objective: Find significant predictors of BMI_CCat (Changes in category) - PA_HW convert to PA_Cat, with cut off >= 2 for low and high - BMI_Changes convert to BMI_CCat, with cut off >= -2.5 for no and yes
asthmads_clean.sav```{r}
asthmads
``````{r}
asthmads <- asthmads %>%
mutate(PA_Cat = cut(PA_HW,
breaks = c(-Inf, 2.0, Inf),
labels = c("Low Intensity", "High Intensity")),
BMI_CCat = cut(BMI_Changes,
breaks = c(-Inf, -2.50, Inf),
labels = c("Marked Changed", "Min Changed")),
BMI_CCat = fct_relevel(BMI_CCat, "Min Changed", "Marked Changed"))
asthmads
``````{r}
slogm_PA <- glm(BMI_CCat ~ PA_Cat,
family = binomial(),
asthmads)
summary(slogm_PA)
tidy(slogm_PA, conf.int = T, exponentiate = T)
```
Call:
glm(formula = BMI_CCat ~ PA_Cat, family = binomial(), data = asthmads)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5691 0.2285 -2.491 0.0127 *
PA_CatHigh Intensity 2.3096 0.4120 5.606 2.07e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 204.09 on 149 degrees of freedom
Residual deviance: 165.07 on 148 degrees of freedom
AIC: 169.07
Number of Fisher Scoring iterations: 4
```{r}
plot(slogm_PA)
```broom:: package
augment(_) to calculate residual and cook distance```{r}
library(broom)
slogm_PA %>%
augment() %>%
bind_cols(idR = asthmads$idR, .)
``````{r}
slogm_PA %>%
augment() %>%
mutate(abs_stdresid = abs(.std.resid)) %>%
slice_max(order_by = abs_stdresid, n = 5)
``````{r}
slogm_PA %>%
augment() %>%
bind_cols(idR = asthmads$idR, .) %>%
slice_max(order_by = .cooksd, n = 5)
```hoslem.test from ResourceSelection:: package```{r}
library(ResourceSelection)
hoslem.test(x = as.numeric(asthmads$BMI_CCat)-1,
y = slogm_PA$fitted.values,
g = 9)
```Warning in hoslem.test(x = as.numeric(asthmads$BMI_CCat) - 1, y =
slogm_PA$fitted.values, : The data did not allow for the requested number of
bins.
Hosmer and Lemeshow goodness of fit (GOF) test
data: as.numeric(asthmads$BMI_CCat) - 1, slogm_PA$fitted.values
X-squared = 1.1517e-26, df = 0, p-value < 2.2e-16
```{r}
library(pROC)
slogm_PA %>%
augment(type.predict = "response") %>%
roc(BMI_CCat, .fitted)
```Setting levels: control = Min Changed, case = Marked Changed
Setting direction: controls < cases
Call:
roc.data.frame(data = ., response = BMI_CCat, predictor = .fitted)
Data: .fitted in 63 controls (BMI_CCat Min Changed) < 87 cases (BMI_CCat Marked Changed).
Area under the curve: 0.7482
```{r}
slogm_PA %>%
augment(type.predict = "response") %>%
roc(BMI_CCat, .fitted) %>%
plot()
```Setting levels: control = Min Changed, case = Marked Changed
Setting direction: controls < cases
we can also plot manually with ggplot
```{r}
slogm_PA_roc <- slogm_PA %>%
augment(type.predict = "response") %>%
roc(BMI_CCat, .fitted)
```Setting levels: control = Min Changed, case = Marked Changed
Setting direction: controls < cases
```{r}
tibble(threshold = slogm_PA_roc$thresholds,
sensitivity = slogm_PA_roc$sensitivities,
specificity = slogm_PA_roc$specificities)
``````{r}
library(caret)
slog_PA_predds <- slogm_PA %>%
augment(type.predict = "response") %>%
mutate(predicted = cut(.fitted,
breaks = c(-Inf, .50, Inf),
labels = c("Min Changed", "Marked Changed"))) %>%
select(observed = BMI_CCat, predicted)
slog_PA_predds
``````{r}
slog_PA_ctab <- with(slog_PA_predds, table(observed, predicted))
slog_PA_ctab
``` predicted
observed Min Changed Marked Changed
Min Changed 53 10
Marked Changed 30 57
```{r}
confusionMatrix(slog_PA_ctab, positive = "Marked Changed")
```Confusion Matrix and Statistics
predicted
observed Min Changed Marked Changed
Min Changed 53 10
Marked Changed 30 57
Accuracy : 0.7333
95% CI : (0.6551, 0.8022)
No Information Rate : 0.5533
P-Value [Acc > NIR] : 4.173e-06
Kappa : 0.4756
Mcnemar's Test P-Value : 0.002663
Sensitivity : 0.8507
Specificity : 0.6386
Pos Pred Value : 0.6552
Neg Pred Value : 0.8413
Prevalence : 0.4467
Detection Rate : 0.3800
Detection Prevalence : 0.5800
Balanced Accuracy : 0.7447
'Positive' Class : Marked Changed
gtsummary::```{r}
asthmads %>%
tbl_uvregression(method = glm,
method.args = list(family = binomial),
y = BMI_CCat,
include = PA_Cat,
exponentiate = T)
```| Characteristic | N | OR1 | 95% CI1 | p-value |
|---|---|---|---|---|
| PA_Cat | 150 | |||
| Low Intensity | — | — | ||
| High Intensity | 10.1 | 4.65, 23.6 | <0.001 | |
| 1 OR = Odds Ratio, CI = Confidence Interval | ||||
```{r}
asthmads %>%
tbl_uvregression(method = glm,
method.args = list(family = binomial),
y = BMI_CCat,
include = c(Gender, Age, WorkStatus, BMI_Pre,
PA_HW, PA_Cat),
exponentiate = T) %>%
bold_p(t = .25)
```| Characteristic | N | OR1 | 95% CI1 | p-value |
|---|---|---|---|---|
| Gender | 150 | |||
| Female | — | — | ||
| Male | 0.51 | 0.26, 0.99 | 0.050 | |
| Age (year) | 150 | 0.98 | 0.87, 1.09 | 0.6 |
| Employment | 150 | |||
| Unemployed | — | — | ||
| Employed | 1.04 | 0.55, 2.01 | 0.9 | |
| BMI_Pre | 150 | 1.06 | 0.99, 1.13 | 0.078 |
| Physical Activity (total hour per week) | 150 | 2.60 | 1.92, 3.73 | <0.001 |
| PA_Cat | 150 | |||
| Low Intensity | — | — | ||
| High Intensity | 10.1 | 4.65, 23.6 | <0.001 | |
| 1 OR = Odds Ratio, CI = Confidence Interval | ||||
```{r}
mlogm_bmic <- glm(BMI_CCat ~ Gender + Age + PA_HW,
family = binomial(),
asthmads)
mlogm_bmic %>%
tbl_regression(exponentiate = T,
estimate_fun = partial(style_ratio, digits = 3),
pvalue_fun = partial(style_pvalue, digits = 2)) %>%
bold_p()
```| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| Gender | |||
| Female | — | — | |
| Male | 0.541 | 0.231, 1.242 | 0.15 |
| Age (year) | 1.009 | 0.877, 1.162 | 0.90 |
| Physical Activity (total hour per week) | 2.607 | 1.919, 3.764 | <0.001 |
| 1 OR = Odds Ratio, CI = Confidence Interval | |||
```{r}
plot(mlogm_bmic)
```residual
```{r}
mlogm_bmic %>%
augment() %>%
bind_cols(id = asthmads$id, .) %>%
mutate(abs_stdresid = abs(.std.resid)) %>%
slice_max(order_by = abs_stdresid, n = 5)
```influential
```{r}
mlogm_bmic %>%
augment() %>%
bind_cols(id = asthmads$id, .) %>%
slice_max(order_by = .cooksd, n = 5)
```HL Test
```{r}
library(ResourceSelection)
hoslem.test(x = as.numeric(asthmads$BMI_CCat)-1,
y = mlogm_bmic$fitted.values,
g = 10)
```
Hosmer and Lemeshow goodness of fit (GOF) test
data: as.numeric(asthmads$BMI_CCat) - 1, mlogm_bmic$fitted.values
X-squared = 8.8943, df = 8, p-value = 0.3513
ROC Curve
```{r}
library(pROC)
mlogm_bmic %>%
augment(type.predict = "response") %>%
roc(BMI_CCat, .fitted)
```Setting levels: control = Min Changed, case = Marked Changed
Setting direction: controls < cases
```{r}
mlogm_bmic %>%
augment(type.predict = "response") %>%
roc(BMI_CCat, .fitted) %>%
plot()
```Setting levels: control = Min Changed, case = Marked Changed
Setting direction: controls < cases
Call:
roc.data.frame(data = ., response = BMI_CCat, predictor = .fitted)
Data: .fitted in 63 controls (BMI_CCat Min Changed) < 87 cases (BMI_CCat Marked Changed).
Area under the curve: 0.8574
Confusion Matrix
```{r}
library(caret)
mlogm_bmic_predds <- mlogm_bmic %>%
augment(type.predict = "response") %>%
mutate(predicted = cut(.fitted,
breaks = c(-Inf, .50, Inf),
labels = c("Min Changed", "Marked Changed"))) %>%
select(observed = BMI_CCat, predicted)
mlogm_bmic_ctab <- with(mlogm_bmic_predds, table(observed, predicted))
confusionMatrix(mlogm_bmic_ctab, positive = "Marked Changed")
```Confusion Matrix and Statistics
predicted
observed Min Changed Marked Changed
Min Changed 48 15
Marked Changed 18 69
Accuracy : 0.78
95% CI : (0.7051, 0.8435)
No Information Rate : 0.56
P-Value [Acc > NIR] : 1.506e-08
Kappa : 0.5514
Mcnemar's Test P-Value : 0.7277
Sensitivity : 0.8214
Specificity : 0.7273
Pos Pred Value : 0.7931
Neg Pred Value : 0.7619
Prevalence : 0.5600
Detection Rate : 0.4600
Detection Prevalence : 0.5800
Balanced Accuracy : 0.7744
'Positive' Class : Marked Changed